Dataset: https://archive.ics.uci.edu/ml/datasets/online+retail
In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import preprocessing, metrics, cluster
%matplotlib inline
Load data
In [2]:
df = pd.read_excel("/data/Online Retail.xlsx")
df.head()
Out[2]:
Find out if there are any nan in the columns
In [3]:
df.info()
In [4]:
df.isna().any()
Out[4]:
Drop all records having nan CustomerId
In [5]:
df = df[~df.CustomerID.isna()]
df.info()
Find number of unique values for each column
In [6]:
df.nunique()
Out[6]:
Use describe to find range of each column
In [7]:
df.describe()
Out[7]:
We see quantity column has a large negative value. Probably negative values are not valid for this analysis. Drop the records having negative values in frequency. Verify that there is not more nagative values in the columns
In [8]:
df = df[df.Quantity>0]
df.info()
In [9]:
df.describe()
Out[9]:
In [10]:
df.head()
Out[10]:
Convert the InvoiceDate to datetime.
In [11]:
df.InvoiceDate = pd.to_datetime(df.InvoiceDate)
In [12]:
df.head()
Out[12]:
Create caculated field to computee TotalPrice
In [13]:
df["TotalPrice"] = df.Quantity * df.UnitPrice
df.head()
Out[13]:
For calculating rececency, use max for InvoiceDate as point of reference.
In [14]:
last_date = df.InvoiceDate.max()
last_date
Out[14]:
Calculate the R-F-M.
In [15]:
rfm = df.groupby("CustomerID").agg({
"InvoiceDate": lambda values: (last_date - values.max()).days,
"InvoiceNo" : lambda values: len(values),
"TotalPrice": lambda values: np.sum(values)
})
rfm.head()
Out[15]:
Rename the columns - "recency", "frequency", "monetary"
In [16]:
rfm.columns = ["recency", "frequency", "monetary"]
rfm.head()
Out[16]:
Digitize the columns for R-F-M into 5 equal buckets. To achieve this, find percentile values as bucket boundaries. These will create 5 buckets of equal sizes.
In [17]:
quantiles = np.arange(1, 6) * 20
quantiles
Out[17]:
In [18]:
rfm["r_score"] = np.digitize(rfm.recency, bins = np.percentile(rfm.recency, quantiles)
, right=True)
rfm["m_score"] = np.digitize(rfm.monetary, bins = np.percentile(rfm.monetary, quantiles)
, right=True)
rfm["f_score"] = np.digitize(rfm.frequency, bins = np.percentile(rfm.frequency, quantiles)
, right=True)
rfm["r_score"] = 4 - rfm["r_score"]
rfm["r_score"] = rfm["r_score"] + 1
rfm["f_score"] = rfm["f_score"] + 1
rfm["m_score"] = rfm["m_score"] + 1
rfm.head()
Out[18]:
In [19]:
rfm.sample(10, random_state=123)
Out[19]:
In [20]:
scaler = preprocessing.StandardScaler()
X = rfm[["r_score", "f_score", "m_score"]].values
X = scaler.fit_transform(X.astype("float32"))
X
Out[20]:
Find what could be an optimal number of clusters using elbow plot. As we see in the plot below, we can use 5 or 6 number of clusters (K) for KMeans algorithm.
In [21]:
inertias = {}
for k in range(2, 10):
kmeans = cluster.KMeans(n_clusters=k, random_state=1)
kmeans.fit(X)
inertias[k] = kmeans.inertia_
pd.Series(inertias).plot()
plt.xlabel("K (num of clusters)")
plt.ylabel("Inertia Score")
Out[21]:
In [22]:
k = 5
kmeans = cluster.KMeans(n_clusters=k, random_state = 1)
rfm["cluster"] = kmeans.fit_predict(X)
In [23]:
rfm.cluster.value_counts()
Out[23]:
In [24]:
rfm["distance"] = 0.0
for i in range(k):
centroid = kmeans.cluster_centers_[i].reshape(1, -1)
cluster_points = X[rfm.cluster == i]
rfm["distance"][rfm.cluster == i] = metrics.euclidean_distances(centroid, cluster_points).flatten()
rfm.sample(20)
Out[24]:
In [25]:
rfm.groupby("cluster").distance.agg(["mean", "count"])
Out[25]:
In [ ]: